Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology

نویسندگان

  • Walid Magdy
  • Kareem Darwish
چکیده

This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Retrieving Arabic Printed Document: a Survey

This paper surveys some of the literature pertaining to searching and retrieving OCR’ed printed documents with emphasis on Arabic documents. It examines peculiarities of Arabic morphology, orthography, retrieval, word clustering, display, OCR, and error correction. The paper surveys existing evaluation test-beds for retrieval of Arabic OCR texts. Lastly, it concludes with possible directions fo...

متن کامل

Enhancing Image-based Arabic Document Translation Using a Noisy Channel Correction Model

An image-based document translation system consists of several components, among which OCR (Optical Character Recognition) plays an important role. However, existing OCR software is not robust against environmental variations. Furthermore, OCR errors are often propagated into the translation component and cause, causing poor end-to-end performance. In this paper, we propose an imagebased docume...

متن کامل

A Statistical Approach to Automatic OCR Error Correction in Context

This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexico...

متن کامل

A Survey on Various OCR Errors

Research has been carried out in correcting words in OCR text and mainly surrounds around (1) non word errors (2) isolated word error correction and context dependent word correction. Various kinds of techniques have been developed. This papers surveys various techniques in correcting these errors and determines which techniques are better. General Terms Optical Character Recognition, Natural L...

متن کامل

Low-resource OCR error detection and correction in French Clinical Texts

In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training mater...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006